Finding phylogeny-aware and biologically meaningful averages of metagenomic samples: L2UniFrac

Abstract Motivation Metagenomic samples have high spatiotemporal variability. Hence, it is useful to summarize and characterize the microbial makeup of a given environment in a way that is biologically reasonable and interpretable. The UniFrac metric has been a robust and widely used metric for measuring the variability between metagenomic samples. We propose that the characterization of metagenomic environments can be improved by finding the average, a.k.a. the barycenter, among the samples with respect to the UniFrac distance. However, it is possible that such a UniFrac-average includes negative entries, making it no longer a valid representation of a metagenomic community. Results To overcome this intrinsic issue, we propose a special version of the UniFrac metric, termed L2UniFrac, which inherits the phylogenetic nature of the traditional UniFrac and with respect to which one can easily compute the average, producing biologically meaningful environment-specific “representative samples.” We demonstrate the usefulness of such representative samples as well as the extended usage of L2UniFrac in efficient clustering of metagenomic samples, and provide mathematical characterizations and proofs to the desired properties of L2UniFrac. Availability and implementation A prototype implementation is provided at https://github.com/KoslickiLab/L2-UniFrac.git. All figures, data, and analysis can be reproduced at https://github.com/KoslickiLab/L2-UniFrac-Paper


Background and context
Microbes play a vital part in many aspects of our lives, and many diseases and conditions are related to the microbiota in and on our body. Examples of which can include various kinds of cancers (Banerjee et al. 2015;Yachida et al. 2019), as well as in behavioral conditions such as sociability (Johnson et al. 2022) or Autism Spectrum Disorder (Kang et al. 2017). Despite holding the key to many of the known problems, metagenomic studies are faced with multiple challenges. One of these difficulties lies in the high variability in metagenomic samples (Mocali and Benedetti 2010;Nunan et al. 2002). Since samples for a given environment collected at different subenvironments or at different time points can be vastly different in microbial distribution, it can be difficult to characterize a specific environment by an average or representative distribution.
In this paper, we will present an approach to create such average or representative distributions of a collection of metagenomes by defining a new modification of the classic UniFrac metric (Lozupone and Knight 2005;McClelland and Koslicki 2018) which we call the L 2 UniFrac. We will look at applications of these representative samples, as well as further providing mathematical characterizations of the L 2 UniFrac metric by giving formal mathematical proofs of its properties.
The code used to generate all data and figures in this manuscript can be found in https://github.com/KoslickiLab/L2-UniFrac-Paper.git and a prototype implementation for L 2 UniFrac can be found in https://github.com/KoslickiLab/ L2-UniFrac.git.

Rationale of approach
In a typical environment-related metagenomics study, clustering is often performed after sample collection and data analysis, the result of which are visualized and analysed by methods such as Principal Coordinate Analysis (PCoA). It is therefore natural that such clusters can be used as a basis or guide in describing an environment, or in finding a representative distribution that characterizes the environment. Intuitively, it is natural that given a large sample size, the centroid (also called the average or the barycenter) of a clustering of these samples with respect to some metric can be used as a representative of the samples in this environment. With such an average, it will be much easier to distinguish the signature microbiomes unique to the environment from random noise that come from microbes rarely in the samples. It will also make the comparison between two environments much easier, simplifying it to the comparison between two sample averages. Furthermore, the inherent variability of metagenomic samples (even those taken from the same or similar environments) can confound analyses and possibly introduce noise in the form of low abundance false positive organisms. By taking an average, such disadvantages can be ameliorated.
Importantly, the choice of metric with which an average is calculated should be carefully chosen in order to gain the desired insight on such characterizations of the environment.
Due to their ubiquity, we focus herein on metrics defined on taxonomic or phylogenetic trees that also have biological relevancy, hence our focus on the UniFrac Metric. Other metrics or similarity measures (such as Jensen-Shannon divergence, Bray-Curtis dissimilarity, simple component-wise Euclidean, or linear algebraic norms, etc.) may be appropriate to use for when analyzing metagenomic data, but all these lack phylogenetic and/or taxonomic awareness so ignore that intrinsic aspect of metagenomic samples.
Finding the barycenter of a group of distributions with respect to an arbitrary metric is not as trivial as it seems. For the purpose of scalability and practicality, the method of computing such a barycenter should be relatively easy and the result easily interpretable biologically. The key to both of these properties lies in the choice of distance metric. A clustering process like this requires a distance metric such that the dissimilarity between two microbial distributions, a property also known as the betadiversity, can be measured. There are multiple such betadiversity metrics. In this paper, we will focus on the discussion of finding the barycenter with respect to the UniFrac metric.
There are two main reasons why the UniFrac metric was chosen among the others. The first is due to its robustness. Being one of the most widely used phylogenetic metrics, the UniFrac metric has demonstrated its usefulness in many areas ranging across clinical studies (Liang et al. 2017), environmental studies (Moura et al. 2018), and forensic science (Wang et al. 2020). Compared to nonphylogenetic metrics such as the Jaccard index and the Bray-Curtis dissimilarity index, the UniFrac metric takes into account the phylogenetic relationships among the organisms, giving it more biological context and hence more interpretability. In real applications, the UniFrac metric has also demonstrated its superiority over other methods, such as in sensitivity in the identification of enterotypes, which are subtypes of the same environment (Liang et al. 2017), making it an ideal candidate. The second reason is due to the unique and interesting mathematical characterization of UniFrac, which makes this biologically motivated problem mathematically interesting at the same time, and leaves rooms for mathematical generalization that could potentially be useful for other related problems yet to be explored.
We begin by first understanding this mathematical characterization and how it is related to our goal of finding the barycenter with respect to the UniFrac distance. About 7 years after the development of the original UniFrac in 2005 (Lozupone and Knight 2005), it was shown by Evans and Matsen (2012) that the UniFrac metric is closely related to a mathematical problem that had been studied for centuries: the optimal transport problem. The optimal transport problem uses a distance known as the Wasserstein distance, or the Earth Mover's distance. Evans and Matsen (2012) showed that the UniFrac distance is equivalent to the one-Wasserstein distance over a phylogenetic tree. Under this formulation, instead of the original computation of UniFrac as a fraction of branch lengths, it can be equivalently computed by representing metagenomic samples by probability distributions and "aggregating" the distributions up a phylogenetic tree through matrix multiplication, obtaining the aggregated vectors in what we will call the "L 1 UniFrac space," followed by taking the L 1 norm between two such aggregated vectors. This alternative characterization of UniFrac, termed the "L 1 UniFrac" by McClelland (2018), has several benefits. First, it allows an alternative representation of a metagenomic sample as an aggregated vector in the L 1 UniFrac space, in which the distance between two vectors can be easily computed by taking the L 1 -norm. Also, this representation of data points in the L 1 UniFrac space leaves room for computation of the L 1 -mean, or median, among a group of samples, giving rise to an easy-to-compute centroid, which can be viewed as the average of the samples in the L 1 UniFrac space. This idea of computing the "L 1 -mean UniFrac" was first mathematically conceptualized by McClelland (2018), who saw it as an opportunity to find the representative sample of an environment with respect to the UniFrac metric.
However, the "L 1 -mean sample" computed using this method does have a detrimental issue. We want the mean/average to have a readily interpretable biological meaning of representing an average sample in the environment, and not simply as an aggregated vector in the L 1 UniFrac space. The former requires it being a distribution vector with each entry being non-negative and all the entries sum up to 1, such that each entry represents the relative abundance of an organism in the sample. This desired property cannot be guaranteed when the mean vector in the L 1 UniFrac space is projected back to the distribution space. Both experiments using real world data and mathematical deductions have shown that the resulting projected vector could contain negative entries (Millward 2022). This makes the result biologically meaningless since it does not make sense to have negative abundance of certain organisms or clades. This means that no matter how robust the traditional L 1 UniFrac is, it is not suitable as a metric with which we compute a biologically meaningful barycenter. Our proposed solution is an alternative characterization of UniFrac using the L 2norm instead of L 1 . Without changing the nature of being a solution to an optimal transport problem, the so-termed L 2 UniFrac retains the original spirit of UniFrac of being a phylogeny-aware dissimilarity metric and offers comparable robustness, while giving us the unique advantage over L 1 UniFrac of producing biologically meaningful barycenter without negative entries with very simple computation.

Materials and methods
Herein, we describe the L 2 UniFrac metric and how to use it to take averages. Full details and proofs are provided in the Supplementary Material.

L 1 and L 2 UniFrac
Given a phylogenetic tree T with N ordered nodes, a metagenomic sample can be represented as a probability distribution P ¼ ðP 1 ; P 2 ; . . . ; P N Þ with entries in the same order as the nodes of T, such that P i represents the relative abundance of organism/taxa i in sample P. Define w j ðiÞ ¼ 1 if i is a node on the subtree of T rooted at node j 0 otherwise: (1) Let W be an N Â N matrix with row j being w j scaled by length of the branch connecting node j and its ancestor. In essence, W is a matrix representation of the phylogeny/structure of T. The matrix W ffi : p is defined similarly, but with the rows scaled by the square root of the branch length.
We can now recall the definition of the UniFrac metric [see, for example, and of Evans and Matsen (2012)  Wei et al.
DEFINITION 1 (UniFrac). For two metagenomic samples represented by the probability distributions P and Q, and with W as defined above for some fixed phylogenetic tree, and for jj Á jj 1 the standard L 1 norm, UniFracðP; QÞ ¼ jjWðP À QÞjj 1 : By simply changing the norm from the L 1 to L 2 norm, we obtain: DEFINITION 2 (L 2 UniFrac). For two metagenomic samples represented by the probability distributions P and Q, and with W ffi Á p as defined above for some fixed phylogenetic tree, and for jj Á jj 2 the standard L 2 norm, In the Supplementary Material, we demonstrate that one can take meaningful averages with respect to the L 2 UniFrac metric, but not with the traditional (L 1 ) UniFrac.
In practice though, we do not need to form the large matrix W ffi : p , but rather proceed in a post-order aggregation to implement the matrix multiplications W ffi : p P and W ffi : p Q in Equation (3) in Definition 2 (see Algorithm 1). After applying Algorithm 1 to compute W ffi : p P and W ffi : p Q, the L 2 UniFrac can then be computed by taking the simple L 2 -norm of the difference of the resulting vectors.
Later, we will need to also compute the inverse of this operation (i.e. compute W À1 ffi : p P and W À1 ffi : p Q), the algorithm for which is similar to the above (see Algorithm 2).
The abovementioned differences in computation between L 1 UniFrac and L 2 UniFrac do, as expected, give us different values for the same samples. However, they reflect the same trend with high correlation (See Fig. 1). In other words, we expect that the insights that can be obtained using L 1 UniFrac can be obtained using L 2 UniFrac just as well.
To illustrate this, we computed both pairwise L 1 UniFrac and L 2 UniFrac using the same data [study ID 714 from Qiita (Gonzalez et al. 2018)] and present the correlation between the L 1 UniFrac and L 2 UniFrac distance for each sample pair in Fig. 1, as well as computing their correlation coefficient.

Computing L 2 UniFrac averages
In the previous section, we showed the differences in the computation of L 1 UniFrac and L 2 UniFrac and demonstrated the similarity in their performance. In this section, we will demonstrate the unique advantage of L 2 UniFrac in computing the average sample, which cannot be achieved using L 1 UniFrac. In the Supplementary Material, we demonstrate that one cannot form biologically meaningful (L 1 ) UniFrac averages of a collection of metagenomic samples represented by a collection of probability vectors. In short, this is due to the probability simplex not being closed under the operation of taking medians. We show this in two ways: first we include in the Supplementary Section S1.4.1 a proof by counterexample that averages with respect to L 1 UniFrac result in uninterpretable "negative abundances." Secondly, this issue does not only occur in carefully constructed theoretical examples, but is actually prevalent in real world cases. We demonstrate this by computing the representative samples for each of the body sites in the Qiita Gonzalez et al. (2018) (Study ID 1928) using L 1 UniFrac and counted the number of negative values present Algorithm 1. L 2 -aggregate: an algorithm to obtain the L 2 -aggregated vector given a probability vector P, resulting in a vector in L 2 UniFrac space.
1: Input: 2: P, T, where P is a probability vector with entries representing relative abundances summing up to 1, T being the tree with the ancestor of a node i denoted by a(i) and the branch length between i and a(i) denoted by l(i). 3: Initialization: P ¼ P 4: for i in 1, . . ., jT j À 1 do " Ordered from the leaves to the root (i.e. post-order) 5: P ½aðiÞþ ¼ P ½i " aggregating mass up 6: P ½i ¼ P ½i Á ffiffiffiffiffiffiffi lðiÞ p 7: end for 8: return P Algorithm 2. Inverse-aggregate: an algorithm that reverse L 2 -aggregate to obtain a probability vector in the original space, given an aggregated vector in the L 2 UniFrac space.
1: Input: 2: P , T, where P is a probability vector in the L 2 UniFrac space, T being the tree with the ancestor of a node i denoted by a(i) and the branch length between i and a(i) denoted by l(i). 3: Initialization: P ¼ P 4: for i in 1, . . ., jT j À 1 do " Ordered from the leaves to the root (i.e. post-order) 5: v ¼ P ½i 6: P ½aðiÞÀ ¼ 1 ffiffiffiffiffi lðiÞ p Ã v 7: end for 8: return P Finding phylogeny-aware and biologically meaningful averages of metagenomic samples i59 in each of the average sample vectors and summarized the data in Table 1. As shown in the table, negative values appeared without exception for all representative samples/averages obtained using L 1 UniFrac, and the number of negative values easily go up to above 200. This demonstrates that averages cannot be taken with respect to the L 1 UniFrac metric. However, the probability simplex is closed under means (see Supplementary Section S1.4), which implies that averages can be taken with L 2 UniFrac. As such, we describe taking averages with respect to the L 2 UniFrac metric only.
The definition of an average (or more precisely, a barycenter) with respect to the L 2 UniFrac metric is as follows: DEFINITION 3 (L 2 UniFrac barycenter). Given a collection of probability distributions fP i g M i¼1 representing metagenomic samples, each of which is given by an N-length vector indexed by the nodes in a fixed phylogenetic tree T, the average, or barycenter, of the set fP i g M i¼1 is given by P Ã : where x is any probability distribution in the simplex on which the P i 's reside. Thus, in practice, to compute a "representative sample" of a collection of vectors fP i g M i¼1 , we proceed as follows: first, form fW ffi : p P i g M i¼1 by repeat application of Algorithm 1. Then, form the L 2 UniFrac barycenter P Ã by taking the componentwise mean of the collection fW ffi : Finally, use Algorithm 2 to compute W À1 ffi : p P Ã , the result of which will be a probability distribution representing the L 2 UniFrac average of the collection fP i g M i¼1 . Most interestingly, we show in the Supplementary Material, Claim 4 that the result of this process is as simple as taking the component-wise mean of the collection of vectors fP i g M i¼1 . For two collections of vectors (representing, say, two different environments) fP i g M 1 i¼1 and fQ i g M 2 i¼1 , the "representative sample" vectors W À1 ffi : p P Ã and W À1 ffi : p Q Ã can be used as a proxy of their respective collections. Thus, instead of needing to form all pairwise L 2 UniFrac distance calculations as: and averaging over the environments, we can much more efficiently do the single computation: This amounts to applying Algorithm 1 to the componentwise averages of fP i g M 1 i¼1 and fQ i g M 2 i¼1 , then taking the L 2norm of the difference between these two vectors.
In the following, we will refer to "L p UniFrac." By this we mean the following: for a sample distribution P (living in the probability simplex), left-multiplying by the aforementioned W (modified depending on the L p space under consideration), we obtain a vector WP living in "L p UniFrac space".

Clustering in L 2 UniFrac space
We hypothesize that phylogeny-aware clustering of samples, which is a common procedure in metagenomic analyses, can  Figure 2. Comparison between the performance of the traditional matrixbased clustering, where pairwise UniFrac distance matrix was computed first and used as the basis for clustering, versus L 2 -clustering, where data points were directly clustered in the L 2 UniFrac space using L 2 norm as a distance metric. Left: running time. Right: clustering quality measured using Fowlkes-Mallows score, with a higher score indicating a better clustering. i60 Wei et al.
be done much faster in the L p UniFrac space than in the distribution space. The reason is as follows: the traditional way of UniFrac-based clustering (in the distribution space) requires the pairwise UniFrac distances be computed prior to applying clustering methods. The computation of UniFrac is not trivial. Pairwise comparison aggravates the cost of computation by an exponential factor as sample size increases. On the other hand, when samples are represented as aggregated vectors in the L p UniFrac space, the UniFrac distance can be readily computed by simply computing the L p -norm. The bulk of the computation lies only in transforming P to WP through matrix multiplication, which is linear with respect to sample size. We tested this hypothesis using L 2 UniFrac as a representative of the L p UniFrac general case. To demonstrate the improvement on clustering speed, we randomly selected 1000 samples out of the 6067 samples of 16S data obtained from Qiita (Gonzalez et al. 2018) (Study ID 1928, consisting of samples collected from four body sites: skin, saliva, vagina, and feces. Out of these samples, we randomly sampled with sample size ranging from 50 to 800 in step of 50. Each of these samples were clustered using two methods: the conventional matrixbased method, in which a pairwise UniFrac distance matrix is computed using the EMDUniFrac (McClelland and Koslicki 2018) algorithm, followed by k-medoids clustering, as well as our proposed method, in which samples were first aggregated to the L 2 UniFrac space, followed by k-means clustering on aggregated vectors. We further computed the Fowlkes-Mallows score for each of the instances as a measure of the clustering quality. The comparisons of both the time cost and the clustering quality are shown in Fig. 2. As expected, for the traditional method, clustering time increased exponentially as sample size increases. According to the figure, for as few as 300 samples, the traditional matrix-based method would require $3 h to complete, whereas clustering in L 2 UniFrac space took only a few minutes. This improvement in speed does come with a tradeoff in clustering quality, as suggested by the figure on the right. However, the difference in clustering score is only $0.1 on a scale of 0-1 and remained fairly constant in the experiments performed. In the case of large data size where the exact accuracy is not of top priority, the significant improvement in speed would deem clustering on L 2 UniFrac space much more practical than the traditional "all pairwise UniFrac" approach in real life scenarios.

Finding the representative sample
In this section, we illustrate the process of finding the average sample using L 2 UniFrac specific algorithms.
The illustration was performed using data with study ID 714 from Qiita (Gonzalez et al. 2018), consisting of 528 samples from different environments. We first performed Algorithm 1 on all samples to obtain the aggregated vectors in the L 2 UniFrac space. We then took the component-wise L 2 mean of these vectors. This mean vector was projected back to the distribution space using Algorithm 2, obtaining an average sample of the original samples. PCoA plots were used to  show the relative relationship among these samples. Out of the five environments, we removed one that had too few samples and singled out each environment together with its representative data point to better observe their relationship for the remaining three. The resulting 3D PCoA plots are shown in Supplementary Fig. S2 and the 2D plots using the first two PC are shown in Fig. 3. From the figure, a representative sample corresponds roughly to the centroid of the cluster consisting of the samples belonging to that environment. It turns out that this average sample can be equivalently computed by simply computing the component-wise mean of the original distributions, as shown in the Supplementary Section S1.4.2. This further simplifies and speeds up the computation and is a unique advantage of L 2 UniFrac that cannot be achieved using L 1 UniFrac (see Supplementary Section S1.4.1), making L 2 UniFrac more than simply an alternative version of UniFrac but instead a novel metric having its own unique applications and properties.

Classification
Up to this point, we have demonstrated the process of obtaining the average sample amongst a pool of samples from the same environment. Ideally, this representative sample should be specific to the environment such that there is sufficient degree of differences between the representative samples of two different environments to significantly distinguish one from another. In this section, we test this hypothesis using a classification test using the same dataset from Section 3.1 consisting of 16S samples from four body sites. Typically, when a group of unknown samples are given, clustering is first performed and classes are assigned based on the clustering result, such as based on the minimum distance to the centroid of the clusters, or by majority vote from a certain number of closest points. This type of methods can be time consuming and the unsupervised nature of this method also does not guarantee a direct correspondence between the segregation of the clusters and that of the actual traits of interest. The alternative we propose here is to first find the representative samples corresponding to each environment using a large size of known sample. For any subsequent new sample, its class can be assigned based on similarity with the representative samples. The accuracy of such assignments can therefore be used to gauge the specificity of such representative samples.
To this end, we partitioned the samples of each body site into 80% of training samples and 20% of testing samples. As representatives of current clustering-based methods, we considered k-medoids and k-means as clustering methods. For each of these clustering methods, we first performed clustering, using the precomputed pairwise L 2 UniFrac distance matrix for k-medoids and probability distribution vectors for kmeans, setting the number of clusters equal to the number of body sites present. With 80% of the training data, the label for each cluster was assigned by majority vote. For each testing sample, the labels assigned by clustering were compared against its true label. For our L 2 UniFrac based method, a representative sample was first computed using the training data. We then computed the L 2 UniFrac distance between each of the testing samples and each of the representative samples and assign the test sample to the body site of which the L 2 UniFrac between the representative sample produced the minimum L 2 UniFrac distance. We used different scoring metrics from the sci-kit learn python package to evaluate the performance of all three classification methods. Figure 4 shows one such score, accuracy, used. The performance evaluated using other scores is shown in Supplementary Fig. S1.
The result shows that our method out-performed the other two methods in most cases. The poor performance of kmedoids and k-means can be partially attributed to the fact that the labels assigned by majority vote after clustering were not consistent with the original classes, though the number of clusters were set to be equivalent to the true number of classes. For instance, in some cases, two clusters were assigned the same label by majority vote, while some of the original classes were missing in this process. This significantly affected the results.
Of course, this comparison is between two different kinds of approaches: supervised and unsupervised. In the case of the L 2 UniFrac method, the exact classes were known and the representative samples were created from samples from a known environment. In the case of k-means and k-medoids, however, the clustering was unsupervised, significantly increasing the proportion of false positives and false negatives. On the other hand, this in turn shows the very advantage of our method of having the potential to convert a traditionally more unsupervised method into one with a more supervised nature by creating a "reference point" for each environment.

Identifying differentially abundant organisms
An additional motivation of finding the average sample that characterizes an environment lies in the hope of identifying signature microbiomes that distinguish one environment significantly from another. Traditionally, in order to identify the most differentially abundant organisms, the abundances of each organism are to be compared across all samples. The results can then be visualized commonly using a box and whisker plot showing the average abundances of the organisms in each sampling environment, based on which a conclusion can be drawn on if the differences are significant, with the aid of statistical methods. We propose the alternative method based on the "flow" between average samples obtained through L 2 UniFrac, giving rise to a phylogenyaware comparison between sample groups.
The formulation of UniFrac as an optimum transport problem allows the differences in two samples to be represented as the "flow" between the two distributions. More specifically, for two samples (or averages) P and Q, the vector W ffi Á p P À W ffi Á p Q represents the flow (or flux) across edges in the phylogenetic tree when moving abundance from P to overlap that  Wei et al.
of Q. This vector W ffi Á p P À W ffi Á p Q is called the differential abundance vector [see McClelland and Koslicki (2018) for the L 1 analog]. The larger the absolute value of the flow between two organisms at a certain node in the phylogentic tree, the greater the difference is between the two organisms at this node/taxa. As such, we propose to obtain first the average samples from each environment and then compute the flow between pairwise average samples for the detection of differentially abundant taxa.
To test the feasibility of our method, we used the PRJEB6070 (BioProject 266076) study downloaded from HumanMetagenomeDB (Kasmanas et al. 2021), consisting of 1261 gut samples grouped by three conditions: colorectal cancer, adenoma, and control. Unlike the previous experiments using 16S rRNA data, this dataset consists of whole-genome shotgun (WGS) data. Importantly, Algorithms 1 and 2 can be applied on any given tree, be it a phylogenetic tree, a taxonomic tree, or some other kind of tree. Depending on what sort of data a user has on hand, the choice of what kind of tree to use can vary. We use a taxonomic tree here for demonstration purposes. Using the principle adopted by the WGSUniFrac method (Wei and Koslicki 2022), the average abundance with respect to each taxonomic ID is computed for each condition, giving rise to three representative samples. Using the taxonomic tree in place of the phylogenetic tree with the reciprocal-based assignment of branch lengths as suggested by the WGSUniFrac method (Wei and Koslicki 2022), the pairwise flow among the three representative samples at phlyum level were computed and illustrated in Fig. 5. These plots depict which nodes in the underlying tree (here, a taxonomic tree) contributed the most to the observed L2UniFrac value between environmental averages: those nodes for which the subtree below it had the largest observed differences in abundance between the environments. The bars indicate the absolute value of the difference in the total subtree abundance, colored by which environment was dominant in that subtree. Here, phylum-level resolution was chosen for the cleanliness of representation, as well as for the ease of comparison with other studies done at phylum level. In real cases, the level of resolution can be freely adjustable. To illustrate this functionality, we have provided the same plots at genus level in Supplementary Fig. S3. The bottom right subfigure in Fig. 5 demonstrates that these differential abundance plots reflect trends that could be observed when analyzing the data at a fixed taxonomic rank, but have the advantage of not requiring a fixed rank at the outset, nor additional computation as computing the L2UniFrac returns the differential abundance vectors as part of its computation.
Based on the differential abundance plots, we identified Bacteroidetes, Firmicutes, Actinobacteria, and Proteobacteria to be some of the phyla with the highest degree of distinguishability across the three conditions. The differential abundances plots show that compared to control, cancer and adenoma conditions tend to have a higher abundance of Bacteroidetes while having a lower abundance in Firmicutes and Actinobacteria. These agree with the box plot in the same as Fig. 5. Interestingly, these phyla had been observed in a previous study (Yachida et al. 2019), supporting the sensitivity of our method. However, though the same phyla were identified to be correlated with the conditions, the trend of correlation  does not seem to agree, with our data showing an increased abundance in Bacteroidetes and decreased abundances in Firmicutes and Actinobacteria in adenoma and cancer conditions compared to the control group. The study carried out by Yachida et al. (2019) demonstrated elevation in all of the above-mentioned phyla. This is likely due to different datasets being compared. Perhaps more studies were to be conducted to reach a decisive conclusion. One thing we do agree, though, is the fact that the observed patterns seem to be progressive, with adenoma falling in between control and cancer, making these phyla strong candidates as indicators of the development of colorectal cancer.

Discussion
In this paper, we proposed L 2 UniFrac as an alternative of the traditional L 1 UniFrac. This L 2 UniFrac preserves the robustness of L 1 UniFrac as a phylogenetic metric for beta-diversity and allows one to compute the average distribution with respect to L 2 UniFrac metric, which is not possible with the original version of UniFrac. Furthermore, we further explored the properties of the L 2 UniFrac space, in which lie the aggregated vectors obtained by aggregating the distributions up the phylogenetic tree. In this space, clustering of metagenomic samples can be performed with much higher efficiency, circumventing the computation of the pairwise distance matrix, with negligible sacrifice of clustering quality. Given the invertibility of the aggregation process, as shown in Supplementary Section S1.2, the L 2 -mean of the vectors in the L 2 UniFrac space can be taken and projected back to the distribution space, giving rise to an average sample with respect to L 2 UniFrac. We further showed that this projected mean is also a probability distribution and is actually none other than the component-wise mean of the metagenomic distributions. This property gives the component-wise average of metagenomic distributions a phylogeny-aware biological interpretation as the L 2 norm of their differences is exactly the L 2 UniFrac value. We also demonstrated some of the potential usage of such average samples through experimentation. Most significantly, such average samples can serve as fingerprints for specific environments, allowing a biologist to better characterize the signature microbes belonging to a specific environment. Our results in Section 3.2 show that despite the between-sample variability even within the same environment, the average sample is sufficient to distinguish one environment from another to a fair degree. The average sample obtained using this method is stable to the environment in the sense that given the same environment and sufficiently large sample size, the variance between different average samples obtained using different sample pools or at different time points can be minimized. The reason is due to the nice property of its equivalence with the component-wise L 2 -mean, making this process equivalent to taking a sample mean. By the Central Limit Theorem, when the sample size is sufficiently large, the sample means will be approximately normally distributed, with a standard deviation inversely proportional to the number of times samples are taken. Of course, the difficulty in defining an environment still remains, such as in the case of disease development, where different stages of disease are often artificially defined based on physiological differences. However, it can be envisioned that by pooling samples from adjacent stages, the average sample of intermediate stages can be computed. Coupled with the method to compute the flow between two distributions as demonstrated in Section 3.2.3, phylogenyaware differential abundance profiles can be obtained, with the potential to provide a trajectory on the dynamic changes in metagenomic diversity with respect to the development of the disease to user-defined resolution.

Supplementary data
Supplementary data is available at Bioinformatics online.

Conflict of interest
None declared.

Funding
This work was supported by the NIH grant 1R01GM146462-01.

Data availability
The code used to generate all data and figures in this manuscript can be found in https://github.com/KoslickiLab/L2-UniFrac-Paper.git and a prototype implementation for L2UniFrac can be found in https://github.com/KoslickiLab/ L2-UniFrac.git.